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With large-scale database systems, statistical analysis of data, formed by probability distributions, 
become an important task in explorative data analysis. Nevertheless, due to specific properties of 
density functions, their proper statistical treatment still represents a challenging task in functional 
data analysis. Namely, the usual L 2 metric does not fully accounts for the relative character of infor¬ 
mation, carried by density functions; instead, their geometrical features are followed by Bayes spaces 
of measures. The easiest possibility of expressing density functions in L 2 space is to use centred lo¬ 
gratio transformation, nevertheless, it results in functional data with a constant integral constraint 
that needs to be taken into account for further analysis. While theoretical background for reasonable 
analysis of density functions is already provided comprehensively by Bayes spaces themselves, pre¬ 
processing issues still need to be developed. The aim of this paper is to introduce optimal smoothing 
splines for centred logratio transformed density functions that take all their specific features into 
account and provide a concise methodology for reasonable preprocessing of raw (discretized) distri¬ 
butional observations. Theoretical developments are illustrated with a real-world data set from official 
statistics. 

Keywords: Bayes spaces, centred logratio transformation, B-spline representation, smoothing 
spline 

Classification codes : 62H99, 65D07, 65D10 


1. Introduction 

Density functions, i.e. Borel measurable, positive functions on a support / with a unit 
integral constraint, can be considered as a special case of functional data f23j. Nowadays, 
they frequently occur in practice due to large-scale database systems, where the individ¬ 
ual observations are summarized by distributions to preserve their intrinsic variability 
and enable to analyze statistically groups of individuals in meaningful way [31122] , i.e. to 
form distribution-valued variables. Nevertheless, for density functions the usual L 2 met¬ 
ric seems to be not appropriate. Not just because of the unit-integral constraint (that 
forms rather a proper representation due to probability conventions than an inherent fea¬ 
ture of densities themselves), in addition, density functions also contain information on 
relative contributions of Borel sets of real line to the overall probability on the (possibly 
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infinite) support of the corresponding random variables. As a way out, Bayes spaces with 
separable Hilbert space properties were proposed for geometric representation of density 
functions HI El HU- In order to enable standard statistical analysis of density functions, 
an isomorphic mapping from the Bayes space to the standard L 2 space is needed. Consid¬ 
ering the finite interval support case (specially / = [a, b] for real a < b), where Lebesgue 
measure as reference measure can be used HU, the simplest such isomorphism is repre¬ 
sented by the centred logratio (clr) transformation, defined for a density function f(x) 
as 


clr[/(x)] = f c (x) = ln f(x) - ^ J Inf (x)dx, (1) 

where g stands for length of the interval I, in particular g = b — a. Because clr transfor¬ 
mation forms a one-to-one mapping, the inverse clr transformation is obtained as 


clr 1 [f c (x)} 


exp (fc{x)) 

fj exp(/ c (x')) dx ’ 


( 2 ) 


the denominator is used just to achieve the unit integral constraint representation of 
the resulting density (without loss of relative information, carried by the density func¬ 
tion). Equations 0 and ([2]) come as a functional version of the popular centred logratio 
transformation of compositional data, multivariate observations that carry relative in¬ 
formation, see mm for details. Nevertheless, because obviously 


jch[f(x)]dx = 0, (3) 

this additional condition needs to be taken into account for computation and analysis 
of clr transformed density functions. Although the clr transformation may lead to com¬ 
putational problems for some of statistical methods due to constraint 0 , it is still well 
acceptable for distance-based methods or functional principal component analysis, simi¬ 
larly as for the case of compositional data, and could thus extend the currently existing 
analytical tools mm- 

However, density functions (as well as functional data in general) occur in the practice 
rarely in their continuous form. For example, the aggregation of individual observations in 
case of the distribution-valued variables leads naturally to discretized form of histogram 
data that need to be approximated (smoothed) by an appropriate function. No wonder 
that approximation of nonparametric distributions is one of basic problems of functional 
data analysis [2jj and a number of publications are devoted to cope with this issue 
[2 [26]. For the purpose of functional data analysis, smoothing H-splines turned out to 
be the most appropriate approximative tool as the resulting coefficients of basis functions 
can be directly used for further statistical analysis |24j . Nevertheless, taking the inherent 
features of densities into account, the possibility of obtaining smoothing splines (and even 
-B-splines) in case of density functions gets quite a complex problem [9l fT3l flbl fl8] . In [ 19] 
logarithmic transformation is proposed to simplify the estimation process and properties 
of the resulting splines, nevertheless, without any deeper methodological background. 

Although the methodology of Bayes spaces was successfully applied to theoretical prob¬ 
lems related with Bayesian approach to statistical analysis mm, its application to 
statistical processing of density functions is still limited due to absence of a reason¬ 
able approximation tool that would enable to proceed from functional data to smooth 
functions. Neither smoothing the original discretised densities [Sj nor using of Bernstein 
polynomials, that is proposed in [TQ, is coherent with the Bayes space methodology. 
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A natural and logical step to perform smoothing of density functions using B-splines is 
to express the data in the L 2 space of clr transformed densities and to perform the compu¬ 
tations there. Consequently, it is guaranteed that all theoretical properties of smoothing 
B-splines hold and no additional reasoning is necessary. Nevertheless, the condition Q 
needs to be taken into account for estimation of spline coefficients. The aim of the pa¬ 
per is to provide a concise solution of this problem from the point of view of numerical 
mathematics. 

The paper is organized as follows. The next section is devoted to general theory of B- 
spline representation, followed by the optimal smoothing problem, representing a trade¬ 
off between interpolation and the least squares approximation, in Section 3. In this 
section also a solution for the conditional smoothing is proposed. Section 4 continues 
with concrete example, based on a real-world data set; finally, Section 5 concludes. 


2. The B-spline representation 

In this section we recall some basics concerning the B-spline representation of splines, see 
for example mm- Let the sequence of knots A A := Ao = a < Ai <...< \ g < b = A a _|_i 
be given. In the following, Sff x [a, b] denotes the vector space of polynomial splines of 
degree k > 0, defined on a finite interval [a, b] with the sequence of knots AA. It is 
known that dim (B^ A [a, b ]) = g + k + l. Then every spline Sk{x ) E S^ x [a, 6] has a unique 
representation 


Sk (x) = b i B i +1 ( x ) • 

i=—k 


( 4 ) 


Vector b = ( b-k ,..., b g ) T is vector of B-spline coefficients of Sk(x), functions B ; fc+1 (x), 
i = —k ,..., g are B-splines of degree k and form basis in S^ x [a, bj. 

Definition 2.1 Let x = (xi, ..., x n ) T be given and let {B^ +1 (x)}®_. be B-spline 
basis of S^ x [a,b], Then 


C fc+ i(x) 


(B k _f ( Xl ) ...B^ 1 ( Xl )\ 

: ■ . : £ g,n,g+fc+l 

U-l 1 (Sn) ■ • • ( Xn )J 


( 5 ) 


is called the collocation matrix. 

Consequently, every spline from S^ x [a, b] can be written in matrix notation as 


Sfc(x) = Cfc + i(x)b. 


( 6 ) 


Now let l E {1,..., k — 1}. It is known that derivative of order l of the spline Sk(x) E 
S^ x [a, b] is a spline Sk-i(x ) E <Sj^[a, b] with the same knots. Using properties of B-splines 
the spline derivatives can be written in matrix notation as 

s fc } ( x ) = C fc+ i_;(x)b (i) , 
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where £ M 9+fc+1 1 is given by 


bW = D;L;b^ _1 ^ 


— D^L;.. DiLib 
= S,b 


and b^ 0 ) = b. Upper triangular matrix S / = D;L/... D 1 L 1 £ M s ' +fc+1 hg+k+i has full 
row rank; hereat Dj £ R9+k+l-j,g+k+l-j is diagonal matrix such that 

Dj = (k + l- j) diag (d_ k+j , ...,d g ) 


with 


and 


di 


1 

^i+k+l—j 


Vi 


-k + j, ■ ■ ■, g 


Lj := 


M i \ 


V -W 


£ ^9+k+l-j,g+k+1-j _ 


3. The optimal smoothing problem 

The optimal smoothing problem represents a compromise between the interpolation prob¬ 
lem and the least squares approximation. Let data (xj,j/i), a < Xi < b, weights Wi > 0, 
i = 1,..,, n, n > g + 1 and parameter a > 0 are given. For l £ {1,..., k — 1} the task is 
to find a spline s k (x) £ S+[a, b], which minimizes functional 


Vz('Sfc) 



n 

dx + a E m [Vi ~ s k (xi)} 2 . 
2=1 


( 7 ) 


This spline is called the smoothing spline. Further we denote x = (xi,...,x n ) T , y = 
(yi, ..., y n ) T , w = (wi,.. ., w n ) T and W = diag ( w). The functional Ji(s k ) can be 
written in a matrix form as 


Ji{ b) = h T N kl b + a 


y 


C fc+ 1 (x)b] T W[y-C fc+ 1 (x)b 


( 8 ) 


Matrix is positive semidefinite, where 


= 


j^k+1—l jz>k-\-l—l 


-k+l 


\ B_ 


k+l 


t')\ 


£ — h9~i~k+l — l 



k+l—l 
-k+l > 


p>k+l—l 

9 


(. B k g +1 - l ,B k g +1 ~ l ) 


J 
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and 


B 


k-\-l—l -r>k-\-l—l 




b 

j B k i +1 -\x)B k j +1 -\x) dx 


a 


stands for scalar product of H-splines in L 2 ([a, b\) space. Matrix M ki is positive definite, 
because B k+l ~\x) > 0, i = —k + l ,..., g are basis functions. 

Our aim is to find a spline Sfc(x) £ S^ x [a, b], which minimizes functional Ji(sk ), in 
other words, we want to find a minimum of function J;(b). From the necessary and 
sufficient condition for a minimum of this function, i.e. 


dJiih) 

db T 


= 0 , 1 }, 


we get a system of linear equations 

a -1 N w + Cj +1 (x)WC fc+1 (x)] b = Cj +1 (x)Wy. 
Solution of this system exists if and only if this system is consistent, i.e. 


( 9 ) 


CT +1 ( x )Wy G U (+ ^(xJWCh^x) ) , 


and it is given by 


b* = 


« _1 N ki + Cj +1 (x)WC fc+1 (x) C(l + i(x)Wy, 


'.T 

'k+l\ 


( 10 ) 


Here 1Z( A) denotes the range of matrix A. Matrix A is a generalized inverse of A, 
i.e. a matrix such that AA A = A, see |25| for details. If matrix A is regular, then 
A~ = A^ 1 . 

Accordingly, if the matrix of the system © is regular, then there exists a unique 
solution of ©, he. there exists a unique smoothing spline. If matrix of Q is singular 
and the system © is consistent, then there exists a class of solutions. In this case we can 
find a minimum norm solution of ©, he. a solution which has the smallest norm among 
all solutions. Such solution is unique and is given by 


b* = 


a" 1 N fcZ + Cj +1 (x)WC A:+ i(x) 


C T 


k +1 


(x)Wy; 


( 11 ) 


then we refer to a unique optimal smoothing spline. The symbol A m denotes a minimum 
norm generalized inverse, i.e. a matrix such that 

AA" A = A, (A" A) T = A" A. 

Consequently, the required smoothing spline or optimal smoothing spline are obtained 
by formula 


40*0 = E b i B i +1 (x), 

i=—k 


( 12 ) 
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where the vector b* = (b*_ k ,... ,6*) T is given in (10) and (11), respectively. 

In the case of smoothed clr transformed density functions an additional condition 

[ s k (x) dx = 0 (13) 

J a 

needs to be fulfilled. From 5-spline representation of splines it is known that the spline 

Sk{x) = Y b i B i +1 (x) 
i=—k 

is a derivative of spline 


if 


s k+1 {x) = Y c i B> i +2 ( x )i 


%— — k — 1 


bi = (k + 1)-^— C ^- Vi = -k,...,g. 


(14) 


Now let’s consider a spline s k (x) E 5^ A [a,6] such that condition (13) holds, i.e. 


fb 

o = / s k (x) dx = [sjfc +1 (x)]^ = s k+ i{Xg+i) - Sfc+i(A 0 ), 

J a 

because a = Ao, b = A s +i. With respect to the definition, properties of 5-splines and the 
mentioned additional knots we get 


0 — Sfc+i (Ag+i) Sfc + i(Ao) — Cg c— k — i, 


so that 


C-fc-l - Cg. 


(15) 


Now we find a relationship between the vector b = (b- k , ■ ■ ■, 6, 
of s k (x) and the vector c = (c_fc_i,..., c g ) T of Sfc+i(x), c E 
write 


g ) T of 5-spline coefficients 
M 9+fc+2 . Using (15) we can 


b = DKc, 


(16) 


where c = (c - k ,..., c g ) T E M 9+fc+1 , 


D = (k + 1) diag 


1 

Ai — X- k 


"’A 


g+k+l 


~ Xr 
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and 


K = 


/ 1 0 0 -l\ 

-1 1 0 --- 0 
0-1 1 • • • 0 


gk 1 , g + k +1 


\ 0 0 ••• -1 1 / 

Note that both matrices D and K are regular. The goal is to find a vector b, which 


minimizes function J;(b) given in ([8]) and satisfies condition (16) simultaneously. Using 
both relationships we can rewrite function J;(b) as 

Jl{ c) = c K D NfciDKc + a [y - C fc+ 1 (x)DKc] T W [y - C fc+ 1 (x)DKc] . 

Minimum of Ji(c) is a solution of the following system, 


a 


-l 


DK) 1 N fcZ DK + (C fc+1 (x)DK) T WC fc+1 (x)DK c = (C w (x)DK) 1 Wy. 


■\T ■ 


(17) 

For this system of linear equations we can perform similar considerations as ([9]) . Con¬ 
cretely, if the system © is consistent and its matrix is not regular, then a solution of 
(jT7j) exists, 


c* = [a" 1 (DK) 1 N,,DK + (C fc+1 (x)DK) 1 WC fc+1 (x)DKj K 1 D 1 C^(x)Wy. 

The minimum norm solution among all solutions of © is given by 


c* = [a" 1 (DK) t N u DK + (C fc+1 (x)DK) T WC fc+1 (x)DK 

Finally, the vector of .B-spline coefficients is obtained as 


K 1 D T CL_ 1 (x)Wy. 


b* = DKc*. 


It is important to note that the resulting B-spline coefficients fulfill the condition of zero 
sum, known from discrete version of the clr transformation. This is important for further 
theoretical reasoning, when functional counterparts to multivariate statistical methods 
for compositional data (like principal component analysis ISIE]) are introduced. 

The resulting splines can be used either to back-transform them to the original Bayes 
space of density functions (in order to provide their smoothed counterparts for visual¬ 
ization and interpretation purposes), or for their further statistical analysis in the clr 
space. In the next section the smoothing step is performed for real data from an Italian 
household survey. 


4. Application to real-world data 

In the following, the above theoretical considerations are applied to a real-world data set 
from official statistics. We employ data from Italian Survey of Household Income and 
Wealth (SHIW), collected by the Bank of Italy [2]. For the purpose of our study, from the 
sampled records of about 8,000 households annual income distributions in all 20 Italian 
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Table 1. Proportions of income classes in 20 Italian regions. 


region 

loc. 


proportions of income classes, pi, 

i = 1,... 

,9 


Piemonte 

N 

0.067 

0.385 

0.323 

0.134 

0.052 

0.022 

0.009 

0.005 

0.003 

Valle d’Aosta 

N 

0.042 

0.340 

0.319 

0.212 

0.042 

0.016 

0.006 

0.016 

0.006 

Lombardia 

N 

0.089 

0.275 

0.269 

0.151 

0.107 

0.056 

0.022 

0.018 

0.012 

Trent ino 

N 

0.058 

0.320 

0.279 

0.127 

0.134 

0.029 

0.041 

0.006 

0.005 

Veneto 

N 

0.103 

0.329 

0.255 

0.177 

0.081 

0.022 

0.015 

0.010 

0.007 

Friuli 

N 

0.084 

0.320 

0.232 

0.168 

0.088 

0.068 

0.028 

0.008 

0.004 

Liguria 

N 

0.081 

0.362 

0.207 

0.213 

0.081 

0.026 

0.026 

0.003 

0.002 

Emilia Romagna 

N 

0.065 

0.303 

0.275 

0.189 

0.085 

0.045 

0.017 

0.015 

0.006 

Toscana 

M 

0.043 

0.283 

0.293 

0.188 

0.105 

0.052 

0.015 

0.015 

0.007 

Umbria 

M 

0.052 

0.351 

0.337 

0.157 

0.056 

0.026 

0.015 

0.004 

0.002 

Marche 

M 

0.115 

0.401 

0.219 

0.153 

0.058 

0.032 

0.014 

0.006 

0.003 

Lazio 

M 

0.120 

0.349 

0.260 

0.150 

0.066 

0.032 

0.012 

0.007 

0.002 

Abruzzo 

S 

0.100 

0.368 

0.294 

0.144 

0.045 

0.030 

0.004 

0.010 

0.005 

Molise 

S 

0.131 

0.349 

0.277 

0.109 

0.080 

0.022 

0.022 

0.006 

0.004 

Campania 

S 

0.238 

0.485 

0.167 

0.066 

0.019 

0.016 

0.006 

0.002 

0.001 

Puglia 

S 

0.238 

0.441 

0.201 

0.068 

0.025 

0.009 

0.011 

0.003 

0.002 

Basilicata 

S 

0.246 

0.385 

0.169 

0.115 

0.038 

0.031 

0.006 

0.006 

0.003 

Calabria 

S 

0.240 

0.408 

0.209 

0.084 

0.037 

0.005 

0.010 

0.004 

0.003 

Sicilia 

S 

0.255 

0.473 

0.161 

0.053 

0.029 

0.014 

0.012 

0.002 

0.001 

Sardegna 

S 

0.167 

0.425 

0.217 

0.123 

0.044 

0.015 

0.006 

0.003 

0.002 

interval midpoints 


6574 

19591 

32608 

45625 

58641 

71658 

84675 

97692 

110709 


regions were aggregated into form of distribution-valued variables. In order to exclude 
outlying observations that would destroy the smoothing process, from non-zero income 
values in the whole data set those above 99% quantile were omitted. Consequently, for 
values from single regions the optimal number of classes was computed (according to 
the well-known Sturges rule), resulting in mean value 9.24, i.e. nine equidistant income 
classes in the range I = [65.7,1.10709 x 10 5 ] were constructed. Data Xi, i = 1,...,9 
thus correspond to midpoints of income intervals, relative frequencies (proportions of 
income classes on the overall income distribution in single regions) stand for yi values. 
Because of count character of values in income classes, also some zero values occurred 
that would make further processing using centred logratio transformation not possible. 
For this reason, their imputation using a model-based procedure was performed [20]; the 
resulting complete data are collected in Table [lj Moreover, for subsequent interpretation 
purposes, the employed Italian regions were divided into three parts according to their 
geographical location (north = N, middle = M, south and islands = S), following the 
National Statistical Institute (ISTAT) classification. 

Furthermore, we perform the discrete version of clr transformation [lj, defined as Z{ = 
hi g ( y y ‘ yg y where g denotes geometric mean of the argument values; as expected, a 

condition Ya=i z i = 0 holds. The obtained values for all employed regions are displayed 
in Table [3 

Now we can finally proceed to approximate the clr transformed proportions in the L 2 
space with optimal smoothing splines from Section 3. Let x denotes the vector of interval 
midpoints and Wi = 1 for i = 1,..., 9. For the purpose of our study, the parameter a was 
set simply to 1, other options are discussed in |6j. Furthermore, we set k = 3, l = 2 (i.e., 
cubic splines are employed) and we will consider knots AA := 0 < 30 000 < 70 000 < 
1.10709 x 10 5 for our optimal smoothing problem. The corresponding spline coefficients 
are collected in Table [3j note that the zero sum constraint is fulfilled as well. 

The resulting optimal smoothing splines can be also back-transformed to the original 
Bayes space of density functions, see Figure [l] It is easy to see from the up plot that 
the B-spline functions nicely follow the clr-transformed proportions. This is due to both 
the proper underlying geometry for the approximation procedure, provided by the L 2 
space, but also the choice of knots that avoids nuisance variability of densities. Moreover, 
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Table 2. Centred logratio transformation of income classes’ proportions in 20 Italian regions. 


region 

loc. 

clr transformation of income 

: classes’ 

proportions, Zi, 

1=1,... 

,9 

Piemonte 

N 

0.587 

2.331 

2.154 

1.271 

0.331 

-0.550 

-1.437 

-1.997 

-2.690 

Valle d’Aosta 

N 

0.015 

2.094 

2.030 

1.624 

0.015 

-0.946 

-1.919 

-0.966 

-1.946 

Lombardia 

N 

0.275 

1.405 

1.383 

0.805 

0.462 

-0.187 

-1.125 

-1.307 

-1.713 

Trentino 

N 

0.084 

1.789 

1.653 

0.873 

0.917 

-0.609 

-0.272 

-2.218 

-2.218 

Veneto 

N 

0.681 

1.843 

1.588 

1.224 

0.442 

-0.865 

-1.232 

-1.638 

-2.043 

Friuli 

N 

0.401 

1.739 

1.417 

1.095 

0.448 

0.190 

-0.697 

-1.950 

-2.643 

Liguria 

N 

0.666 

2.166 

1.606 

1.637 

0.666 

-0.473 

-0.473 

-2.662 

-3.132 

Emilia Romagna 

N 

0.140 

1.681 

1.584 

1.209 

0.405 

-0.223 

-1.204 

-1.291 

-2.303 

Toscana 

M 

-0.259 

1.619 

1.654 

1.211 

0.627 

-0.083 

-1.319 

-1.319 

-2.130 

Umbria 

M 

0.348 

2.252 

2.209 

1.447 

0.417 

-0.345 

-0.905 

-2.291 

-3.133 

Marche 

M 

0.965 

2.211 

1.607 

1.247 

0.272 

-0.326 

-1.114 

-2.031 

-2.831 

Lazio 

M 

0.982 

2.046 

1.753 

1.201 

0.386 

-0.345 

-1.301 

-1.811 

-2.910 

Abruzzo 

S 

0.856 

2.164 

1.938 

1.227 

0.057 

-0.348 

-2.308 

-1.447 

-2.140 

Molise 

S 

1.000 

1.982 

1.748 

0.818 

0.508 

-0.791 

-0.791 

-2.071 

-2.404 

Campania 

S 

2.267 

2.980 

1.914 

0.983 

-0.245 

-0.428 

-1.344 

-2.730 

-3.396 

Puglia 

S 

2.009 

2.628 

1.844 

0.756 

-0.247 

-1.258 

-1.035 

-1.952 

-2.746 

Basilicata 

S 

1.839 

2.286 

1.465 

1.082 

-0.017 

-0.240 

-1.841 

-1.933 

-2.641 

Calabria 

S 

1.988 

2.516 

1.848 

0.931 

0.105 

-1.841 

-1.148 

-2.100 

-2.298 

Sicilia 

S 

2.145 

2.763 

1.684 

0.574 

-0.014 

-0.776 

-0.931 

-2.722 

-2.722 

Sardegna 

S 

1.657 

2.591 

1.918 

1.351 

0.322 

-0.777 

-1.693 

-2.521 

-2.848 

interval midpoints 


6574 

19591 

32608 

45625 

58641 

71658 

84675 

97692 

110709 


Table 3. B-spline coefficients for clr transformed density functions of 20 Italian 
regions. 


region loc. spline coefficients for given knots, b* , i = —3,..., 2 


Piemonte 

N 

1.972 

2.650 

2.376 

-0.907 

-2.202 

-2.734 

Valle d’Aosta 

N 

-1.801 

1.192 

3.979 

-2.791 

-1.048 

-1.877 

Lombardia 

N 

-1.362 

1.609 

1.413 

-0.129 

-1.788 

-1.708 

Trentino 

N 

-2.686 

2.512 

1.128 

0.519 

-2.250 

-2.357 

Veneto 

N 

-0.660 

1.602 

2.370 

-1.079 

-1.872 

-2.067 

Friuli 

N 

-2.065 

2.531 

0.742 

0.828 

-2.205 

-2.728 

Liguria 

N 

-1.756 

2.628 

1.467 

0.487 

-2.643 

-3.299 

Emilia Romagna 

N 

-1.806 

1.609 

2.080 

-0.666 

-1.506 

-2.297 

Toscana 

M 

-2.595 

1.615 

2.038 

-0.254 

-1.823 

-2.101 

Umbria 

M 

-2.517 

2.625 

2.226 

-0.444 

-2.144 

-3.254 

Marche 

M 

-1.260 

2.816 

1.418 

-0.160 

-2.239 

-2.896 

Lazio 

M 

-0.750 

2.247 

1.889 

-0.555 

-2.015 

-2.944 

Abruzzo 

S 

0.872 

2.184 

2.542 

-1.268 

-2.280 

-2.056 

Molise 

S 

-0.827 

2.474 

1.541 

-0.448 

-2.073 

-2.508 

Campania 

S 

-0.275 

4.574 

0.753 

-0.177 

-2.878 

-3.523 

Puglia 

S 

0.372 

3.304 

1.843 

-1.926 

-1.449 

-2.856 

Basilicata 

S 

0.476 

2.974 

1.174 

-0.267 

-2.672 

-2.633 

Calabria 

S 

1.041 

2.570 

2.561 

-2.272 

-1.802 

-2.400 

Sicilia 

S 

-0.360 

4.563 

0.260 

0.097 

-2.864 

-2.874 

Sardegna 

S 

-0.124 

3.085 

1.937 

-0.529 

-3.097 

-2.903 


regional effects of income distribution are nice visible, in particular different distributions 
for northern and middle Italian regions comparing to the southern ones. 

Finally, we compare the presented approach with an alternative concept of using cubic 
smoothing splines for the original proportional data (Table [l]) as proposed in [8j, where 
the knots correspond to data points Xi, i = 1,..., 9; furthermore, parameters k = 3 and 
l = 2 were set as before. From Figure [2] with the resulting smoothed densities several 
features can be easily derived. In addition to choppy right tails as a consequence of in¬ 
appropriate choice of knots (a typical situation is shown in Figure [3j up, for the case of 
Liguria region) also negative values are reached on the left tail of curves, and for some 
of regions even on the right tail of densities as well. Although numerically this undesired 
effect (that was automatically avoided before) could be overcome by the logarithmic 
transformation of the original density functions and subsequent maximum-likelihood es¬ 
timation of the LLspline coefficients as proposed in [23] , this approach would still ignore 
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Figure 1. Optimal smoothing spline for SHIW data for all regions for clr transformed data (up) and after its 
inverse clr transformation (down). Lines are denoted according to geographical location of regions: N - dashed 
black line, M - solid black line, S - dotted grey line. 


the inherent geometrical features of densities, captured by the Bayes space methodology. 
Moreover, as a consequence of the relative scale property of density functions, captured 
by smoothing in the clr space, also data structure itself seems to be quite different for 
both approaches. For example, Valle d’Aosta with the highest maximum among income 
distributions of northern regions (see Figure [IJ down) shows an exceptional behaviour, 
when the estimation is performed in the clr space. This could be easily explained by the 
fact that in this region there is indeed a lot of wealth with respect to the rest of Italy 
and the local population gains also from a low-tax policy due to the special autonomous 
status of the Valle d’Aosta region. On the other hand, this particular behaviour is rather 
not visible, when approximating the original proportions. 


5. Conclusions 

With modern monitoring tools that enable to produce huge amounts of data, statistical 
analysis of density functions become more and more important. Although the theoretical 
background for their meaningful statistical treatment, provided by Bayes spaces, is al¬ 
ready available, for its practical applicability a wide spectrum of mathematical problems 
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Figure 2. Cubic smoothing spline for SHIW data for all regions using the original proportional data. 


needs to be solved. This paper handled one of them, smoothing of clr transformed den¬ 
sity functions. We have shown that B-splines represent an easy-to-handle tool for this 
purpose, moreover, the special case of smoothing splines provides a trade-off between the 
interpolation problem and the least squares approximation. For the real-world data ex¬ 
ample, where income distributions in 20 Italian regions were analyzed, the advantages of 
the Bayes space approach were clearly demonstrated by comparison with approximation 
of the original densities using cubic smoothing splines as proposed in [8]. In addition to 
the above advances also “dimention reduction” of the discrete input data, provided by 
the 5-spline coefficients, should be mentioned that enables to analyze statistically also 
high-dimensional functional data (resulting, e.g., from metabolomical experiments). We 
avoided to compare the results with the case of approximation of density functions with 
Bernain polynomials, proposed in [21) . as they represent a completely different concept of 
representation of densities; nevertheless, also here similar effects, resulting from ignoring 
the relative character of density functions, could be observed. 

Of course, this paper cannot be considered as a final solution, it rather opens a fur¬ 
ther variety of challenges concerning smoothing of density functions using Bayes spaces 
(concerning choice of basis functions, optimal choice of knots etc.). On the other hand, 
we are convinced that it provides a concise approach and set a clear direction for future 
research developments in the field. 
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